dq tool
A Survey on Data Quality Dimensions and Tools for Machine Learning
Zhou, Yuhan, Tu, Fengjiao, Sha, Kewei, Ding, Junhua, Chen, Haihua
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.
- North America > United States > Texas > Denton County > Denton (0.14)
- Asia > Singapore (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (7 more...)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.46)
Towards augmented data quality management: Automation of Data Quality Rule Definition in Data Warehouses
Tamm, Heidi Carolina, Nikiforova, Anastasija
In the contemporary data-driven landscape, ensuring data quality (DQ) is crucial for deriving actionable insights from vast data repositories. The objective of this study is to explore the potential for automating data quality management within data warehouses as data repository commonly used by large organizations. By conducting a systematic review of existing DQ tools available in the market and academic literature, the study assesses their capability to automatically detect and enforce data quality rules. The review encompassed 151 tools from various sources, revealing that most current tools focus on data cleansing and fixing in domain-specific databases rather than data warehouses. Only a limited number of tools, specifically ten, demonstrated the capability to detect DQ rules, not to mention implementing this in data warehouses. The findings underscore a significant gap in the market and academic research regarding AI-augmented DQ rule detection in data warehouses. This paper advocates for further development in this area to enhance the efficiency of DQ management processes, reduce human workload, and lower costs. The study highlights the necessity of advanced tools for automated DQ rule detection, paving the way for improved practices in data quality management tailored to data warehouse environments. The study can guide organizations in selecting data quality tool that would meet their requirements most.
- Europe > Spain (0.04)
- Europe > Estonia > Tartu County > Tartu (0.04)
- Europe > Switzerland (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Information Technology > Services (1.00)
- Banking & Finance (1.00)
- Information Technology > Software (0.93)
- Information Technology > Security & Privacy (0.93)